Purpose

Trying out DMM clustering using this workflow (https://microbiome.github.io/tutorials/DMM.html).

Set-up

library(here)
## here() starts at /Users/bowiek/Library/CloudStorage/OneDrive-OregonHealth&ScienceUniversity/MrOS
library(phyloseq)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(microshades)
library(patchwork)
library(ggpubr)
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
library(microbiome)
## 
## microbiome R package (microbiome.github.com)
##     
## 
## 
##  Copyright (C) 2011-2022 Leo Lahti, 
##     Sudarshan Shetty et al. <microbiome.github.io>
## 
## Attaching package: 'microbiome'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## The following object is masked from 'package:base':
## 
##     transform
library(DirichletMultinomial)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:tidyr':
## 
##     expand
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:microbiome':
## 
##     coverage
## The following object is masked from 'package:rstatix':
## 
##     desc
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## The following object is masked from 'package:phyloseq':
## 
##     distance
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract

Data origin

This ps object was cleaned up in Box/Kate/MrOS/data/Analyses/MrOS_EDA.Rmd. ALl of the data is described explicitly in that RMD. The ps object is rarefied

Code from Erin Dahl created 04/05/22 sent to me in an email. Labeled DMM_MrOs.html

load(here("data", "ps_subcat.RData"))

# Creating a variable with an untouched ps object
ps_orig <- ps

Agglomerating to Genus

ps <- tax_glom(ps, "Genus")

DMM!

Identify Core Taxa

# To speed up, only consider the core taxa
# that are prevalent at 0.01% relative abundance in 20% of the samples
# (note that this is not strictly correct as information is
# being discarded; one alternative would be to aggregate rare taxa)

ps_comp <- microbiome::transform(ps_orig, "compositional") #changes all abundances to relative

taxa <- core_members(ps_comp, detection = 0.01/100, prevalence = 10/100) # only considers taxa at 0.01% relative abundance in 20% of samples

ps_sub <- prune_taxa(taxa, ps)

taxa_names(ps_sub) <- data.frame(ps_sub@tax_table)$Genus #taking everything to the genus level, I guess?
  
# Pick the OTU count matrix
# and convert it into samples x taxa format
dat <- abundances(ps_sub)
count <- as.matrix(t(dat))

# remove the five rows with 0 sum
count_fixed <- count[rowSums(count) >0,]
count_removed <- count[rowSums(count) == 0,]

#Samples removed 
rownames(count_removed)
## NULL

Starts with 571, but goes down to 34 once looking at core taxa (prevalence = 20/100)

Starts with 571, goes down to 54 taxa (prevalence 10/100)

Fit DMM Model

This seed supposedly keeps the model consistent no matter how many times run. There are slight variations, but largely the same.

fit <- lapply(1:12, dmn, seed = 120821, count = count_fixed, verbose=TRUE)
## dmn, k=1
##   Soft kmeans
##   Expectation Maximization setup
##   Expectation Maximization
##   Hessian
## dmn, k=2
##   Soft kmeans
##     iteration 10 change 0.006313
##     iteration 20 change 0.003247
##     iteration 30 change 0.010098
##     iteration 40 change 0.016607
##     iteration 50 change 0.000195
##     iteration 60 change 0.000001
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 3.537368
##     iteration 20 change 9.402401
##     iteration 30 change 0.000003
##   Hessian
## dmn, k=3
##   Soft kmeans
##     iteration 10 change 0.016840
##     iteration 20 change 0.001601
##     iteration 30 change 0.000007
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 1.609024
##     iteration 20 change 0.000478
##     iteration 30 change 0.000001
##   Hessian
## dmn, k=4
##   Soft kmeans
##     iteration 10 change 0.028317
##     iteration 20 change 0.000251
##     iteration 30 change 0.000005
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 3.994406
##     iteration 20 change 5.707279
##     iteration 30 change 0.121941
##     iteration 40 change 0.000081
##   Hessian
## dmn, k=5
##   Soft kmeans
##     iteration 10 change 0.016981
##     iteration 20 change 0.000167
##     iteration 30 change 0.000001
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 4.542790
##     iteration 20 change 6.254182
##     iteration 30 change 0.026815
##     iteration 40 change 0.000031
##   Hessian
## dmn, k=6
##   Soft kmeans
##     iteration 10 change 0.024778
##     iteration 20 change 0.000042
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 13.296472
##     iteration 20 change 0.316766
##     iteration 30 change 0.194911
##     iteration 40 change 1.406707
##     iteration 50 change 7.190685
##     iteration 60 change 0.046072
##     iteration 70 change 0.000835
##     iteration 80 change 0.000156
##     iteration 90 change 0.000069
##     iteration 100 change 0.000047
##   Hessian
## dmn, k=7
##   Soft kmeans
##     iteration 10 change 0.018533
##     iteration 20 change 0.041837
##     iteration 30 change 0.000038
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 12.245453
##     iteration 20 change 14.909620
##     iteration 30 change 5.048033
##     iteration 40 change 0.260315
##     iteration 50 change 0.223842
##     iteration 60 change 0.238205
##     iteration 70 change 0.069831
##     iteration 80 change 0.008170
##     iteration 90 change 0.019363
##     iteration 100 change 0.001550
##   Hessian
## dmn, k=8
##   Soft kmeans
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 7.511017
##     iteration 20 change 0.628649
##     iteration 30 change 0.704308
##     iteration 40 change 0.524228
##     iteration 50 change 0.035005
##     iteration 60 change 0.127707
##     iteration 70 change 0.004504
##     iteration 80 change 0.003741
##     iteration 90 change 0.001009
##     iteration 100 change 0.000106
##   Hessian
## dmn, k=9
##   Soft kmeans
##     iteration 10 change 0.051147
##     iteration 20 change 0.023488
##     iteration 30 change 0.000211
##     iteration 40 change 0.000002
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 27.770269
##     iteration 20 change 17.282921
##     iteration 30 change 0.485548
##     iteration 40 change 0.009255
##     iteration 50 change 0.000088
##     iteration 60 change 0.000001
##   Hessian
## dmn, k=10
##   Soft kmeans
##     iteration 10 change 0.079910
##     iteration 20 change 0.001438
##     iteration 30 change 0.000040
##     iteration 40 change 0.000001
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 19.629779
##     iteration 20 change 1.011632
##     iteration 30 change 0.216636
##     iteration 40 change 0.000073
##   Hessian
## dmn, k=11
##   Soft kmeans
##     iteration 10 change 0.013396
##     iteration 20 change 0.000629
##     iteration 30 change 0.000007
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 29.439336
##     iteration 20 change 2.329879
##     iteration 30 change 0.804719
##     iteration 40 change 0.141667
##     iteration 50 change 5.217509
##     iteration 60 change 2.807502
##     iteration 70 change 0.473148
##     iteration 80 change 0.026707
##     iteration 90 change 0.081518
##     iteration 100 change 0.000103
##   Hessian
## dmn, k=12
##   Soft kmeans
##     iteration 10 change 0.017565
##     iteration 20 change 0.000485
##     iteration 30 change 0.000205
##     iteration 40 change 0.000076
##     iteration 50 change 0.000025
##     iteration 60 change 0.000008
##     iteration 70 change 0.000003
##   Expectation Maximization setup
##   Expectation Maximization
##     iteration 10 change 16.471662
##     iteration 20 change 44.193673
##     iteration 30 change 3.532473
##     iteration 40 change 0.178164
##     iteration 50 change 0.104794
##     iteration 60 change 1.252516
##     iteration 70 change 0.024182
##     iteration 80 change 0.022576
##     iteration 90 change 0.091442
##     iteration 100 change 0.021216
##   Hessian
# runif(1, 0, .Machine$integer.max)

Evaluate model fit

# Methods for evaluating the fit of the model
lplc <- base::sapply(fit, DirichletMultinomial::laplace) # AIC / BIC / Laplace
aic  <- base::sapply(fit, DirichletMultinomial::AIC) # AIC / BIC / Laplace
bic  <- base::sapply(fit, DirichletMultinomial::BIC) # AIC / BIC / Laplace

# Plot model fit evaluation
plot(lplc, type="b", xlab="Number of Dirichlet Components", ylab="Model Fit")
lines(aic, type="b", lty = 2)
lines(bic, type="b", lty = 3)

Lisa instructed me to look at clusters 4-12. According to this diagram, 8 at is at minimum, meaning it has the best clustering. However, according to Paul Spellman the less clusters the better and he typically looks at a difference of 0, so clusters 4-6.

Picking model

fit_4 <- fit[[4]]
fit_5 <- fit[[5]]
fit_6 <- fit[[6]]
fit_7 <- fit[[7]]
fit_8 <- fit[[8]]
fit_9 <- fit[[9]]
fit_10 <- fit[[10]]
fit_11 <- fit[[11]]
fit_12 <- fit[[12]]

Interpretation:

assignments_4 <- data.frame(apply(mixture(fit_4), 1, which.max))
colnames(assignments_4) <- "k_4"

assignments_5 <- data.frame(apply(mixture(fit_5), 1, which.max))
colnames(assignments_5) <- "k_5"
assignments_6 <- data.frame(apply(mixture(fit_6), 1, which.max))
colnames(assignments_6) <- "k_6"

assignments_7 <- data.frame(apply(mixture(fit_7), 1, which.max))
colnames(assignments_7) <- "k_7"
assignments_8 <- data.frame(apply(mixture(fit_8), 1, which.max))
colnames(assignments_8) <- "k_8"
assignments_9 <- data.frame(apply(mixture(fit_9), 1, which.max))
colnames(assignments_9) <- "k_9"
assignments_10 <- data.frame(apply(mixture(fit_10), 1, which.max))
colnames(assignments_10) <- "k_10"

assignments_11 <- data.frame(apply(mixture(fit_11), 1, which.max))
colnames(assignments_11) <- "k_11"
assignments_12 <- data.frame(apply(mixture(fit_12), 1, which.max))
colnames(assignments_12) <- "k_12"

dmm_classification <- merge(assignments_4, assignments_5, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL

dmm_classification <- merge(dmm_classification, assignments_6, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL

dmm_classification <- merge(dmm_classification, assignments_7, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL

dmm_classification <- merge(dmm_classification, assignments_8, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL

dmm_classification <- merge(dmm_classification, assignments_9, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL

dmm_classification <- merge(dmm_classification, assignments_10, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL

dmm_classification <- merge(dmm_classification, assignments_11, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL

dmm_classification <- merge(dmm_classification, assignments_12, by = "row.names")
rownames(dmm_classification)<- dmm_classification$Row.names
dmm_classification$Row.names <- NULL

Add cluster info to sample data

temp_sam <- data.frame(ps_orig@sam_data)

new_sam <- merge(temp_sam, dmm_classification, by = "row.names", all=TRUE)
rownames(new_sam) <- new_sam$Row.names
new_sam$Row.names <- NULL
sample_data(ps_orig) <- new_sam

Plotting!

mdf <- ps_orig %>% 
          psmelt()
## Warning in psmelt(.): The sample variables: 
## Sample
##  have been renamed to: 
## sample_Sample
## to avoid conflicts with special phyloseq plot attribute names.

4 clusters

color_dfs <- create_color_dfs(mdf, group_level = "Phylum", subgroup_level = "Genus")
  
color_dfs <- reorder_samples_by(color_dfs$mdf,color_dfs$cdf, order_tax = "Firmicutes")

plot_g <- plot_microshades(color_dfs$mdf,color_dfs$cdf)
## Warning in is.na(cdf$hex) || is.na(cdf$group): 'length(x) = 25 > 1' in coercion
## to 'logical(1)'

## Warning in is.na(cdf$hex) || is.na(cdf$group): 'length(x) = 25 > 1' in coercion
## to 'logical(1)'
plot_g + facet_grid(~k_4, scales = "free_x")

#temp <- data.frame(ps@tax_table)

Overall it looks like clusters are similar in size, although cluster 1 is maybe a little smaller. It looks like cluster 1 is primarily Firmicutes (staph), mixture 2 is Firmicutes (other) mixed with Actinobacteria, 3 is a mixture of a bunch of different phyla, while the last cluster is mainly Proteobacteria (Neisseria). Cluster 3 is a little hard to define, but probably a mixed cluster.

5 clusters

plot_g + facet_grid(~k_5, scales = "free_x")

The first two clusters look similar to the k=4 graph above. Cluster 3 is interesting because it’s primarily Proteogacteria and has the least amount of Firmictues. Cluster 4 on the otherhand has the most Firmucutes and seems to be mixed with Actinobacteria. Cluster 5 is primarily Firmictutes-Other and Bacteroidetes. I def like these groupings better than k=4.

6 clusters

plot_g + facet_grid(~k_6, scales = "free_x")

According to the fit graph, this is the second best cluster, with k=8 being the best. The first two clusters look similar. There’s still a Proteobacteria dominated cluster (4) and a Bacteroidetes cluster, which also has “other” Firmicutes (3). What’s added is cluster 6, which is dominated by e.coli, however is made up of only 6 samples… which I’m kinda eh about.

mixturewt(fit_6)
pi theta
0.2974321 2.253109
0.2018315 6.500637
0.1829721 20.323644
0.1688639 9.880016
0.1321411 10.400118
0.0167593 24.833727

I added the pi/theta measurements here. Pi is the amount of samples in each cluster.

7 clusters

plot_g + facet_grid(~k_7, scales = "free_x")

This is interesting, because according to the fit eval, the fit gets slightly worse with 7 clusters. Also it has two of these smaller clusters that are dominated by specific genera, either e.coli or a Bacteroidetes one that has mostly firmicutes-other.

mixturewt(fit_7)
pi theta
0.3531051 2.489032
0.2043463 8.675721
0.1502068 9.878811
0.1382811 9.552434
0.1225539 32.661364
0.0167699 24.832315
0.0147370 51.866208

8 clusters

plot_g + facet_grid(~k_8, scales = "free_x") + theme(legend.position = "bottom", axis.text.x = element_blank(), legend.title = element_blank(), legend.text = element_text(size = 12))

According to the fit eval graph, this is the “optimal” number of clusters. The last two clusters do not seem to have as many samples as the others, however they look very similar. I find cluster 6 interesting because it has very little firmicutes, and cluster 5 because it has a lot of firmicutes however it’s specifically other genera (I wonder if the genera are the same, but we can’t actually visualize it- I suppose this is where contribution plots will come in handy).

mixturewt(fit_8)
pi theta
0.2649356 2.067840
0.1512581 6.996047
0.1364068 9.222521
0.1265088 7.127429
0.1187904 10.097499
0.0903646 15.676812
0.0612442 79.621443
0.0504915 16.303491

9 clusters

plot_g + facet_grid(~k_9, scales = "free_x")

Here we get back to these very very low number of sample clusters (7-9), which I think are kind of funky and I’m not into them, although they do have a very high theta (measure of similarity).

mixturewt(fit_9)
pi theta
0.2438355 2.050104
0.1728209 7.056511
0.1465091 9.423708
0.1331145 10.218135
0.1319222 10.700293
0.1214418 32.825735
0.0188243 7.316058
0.0167960 24.806029
0.0147357 50.773428

10 clusters

plot_g + facet_grid(~k_10, scales = "free_x")

I think these clusters are getting small, and the very specific clusters of <10 samples may make for some wonky analyses.

11 clusters

plot_g + facet_grid(~k_11, scales = "free_x")

Getting into very small clusters and I don’t think we should consider this one.

12 clusters

plot_g + facet_grid(~k_12, scales = "free_x")

These clusters are very very small, so I don’t think this is worth analyzing.

clean-up

rm(list=ls()[! ls() %in% c("ps_orig","new_sam")])

saving new ps_object

#save.image(here("data", "ps_dmm_6.RData"))